Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
Produced with R version 3.2.1 and pomp version 0.71.1.
POMP model schematic, showing dependence among model variables. State variables, \(X\), at time \(t\) depend only on state variables at the previous timestep. Measurements, \(Y\), at time \(t\) depend only on the state at that time. Formally, \({\mathbb{P}\left[{X_t|X,Y}\right]}={\mathbb{P}\left[{X_t|X_{t-1}}\right]}\) and \({\mathbb{P}\left[{Y_t|X,Y}\right]}={\mathbb{P}\left[{Y_t|X_{t}}\right]}\) for all \(X=\{X_1,X_2,\dots,X_T\}\), \(Y=\{Y_1,Y_2,\dots,Y_T\}\), and \(t=1,\dots,T\).
Write \(X_n=X(t_n)\) and \(X_{0:N}=(X_0,\dots,X_N)\).
Let \(Y_n\) be a random variable modeling the observation at time \(t_n\).
The one-step transition density, \(f_{X_n|X_{n-1}}(x_n|x_{n-1};\theta)\), together with the measurement density, \(f_{Y_n|X_n}(y_n|x_n;\theta)\) and the initial density, \(f_{X_0}(x_0;\theta)\), specify the entire joint density via \[f_{X_{0:N},Y_{1:N}}(x_{0:N},y_{1:N};\theta) = f_{X_0}(x_0;\theta)\,\prod_{n=1}^N\!f_{X_n | X_{n-1}}(x_n|x_{n-1};\theta)\,f_{Y_n|X_n}(y_n|x_n;\theta).\]
The marginal density for sequence of measurements, \(Y_{1:N}\), evaluated at the data, \(y_{1:N}^*\), is \[{\mathscr{L}}(\theta) = f_{Y_{1:N}}(y^*_{1:N};\theta)=\int\!f_{X_{0:N},Y_{1:N}}(x_{0:N},y^*_{1:N};\theta)\, dx_{0:N}.\]
Lets’ begin with a special case. Suppose that the unobserved state process is deterministic. That is, \(X_{n}=x_n(\theta)\) is a known function of \(\theta\) for each \(n\). What is the likelihood?
Since the probability of the observation, \(Y_n\), depends only on \(X_n\) and \(\theta\), and since, in particular \(Y_{m}\) and \(Y_{n}\) are independent given \(X_{m}\) and \(X_{n}\), we have \[{\mathscr{L}}(\theta) = \prod_{n} f_{Y_n|X_n}(y_n^*;x_n(\theta),\theta)\] or \[\log{\mathscr{L}}(\theta) = \sum_{n} \log f_{Y_n|X_n}(y_n^*;x_n(\theta),\theta).\]
We can imagine using Monte Carlo integration for computing the likelihood of a state space model, \[{\mathscr{L}}(\theta)={\mathbb{P}\left[{y_{1:T}|\theta}\right]}=\sum_{x_1}\cdots\sum_{x_T}\!\prod_{t=1}^{T}\!{\mathbb{P}\left[{y_t|x_t,\theta}\right]}\,{\mathbb{P}\left[{x_t|x_{t-1},\theta}\right]}.\] Specifically, if we have some probabilistic means of proposing trajectories for the unobserved state process, then we could just generate a large number of these and approximate \({\mathscr{L}}(\theta)\) by its Monte Carlo estimate. Specifically, we generate \(N\) trajectories of length \(T\), \(x_{t,k}\), \(k=1\,\dots,N\), \(t=1,\dots,T\). Let \(w_k\) denote the probability of proposing trajectory \(k\). For each trajectory, we compute the likelihood of that trajectory \[\ell_k(\theta)=\prod_{t=1}^{T} {\mathbb{P}\left[{y_t|x_{t,k},\theta}\right]}\,{\mathbb{P}\left[{x_{t,k}|x_{t-1,k},\theta}\right]}\] Then by the Monte Carlo theorem, we have \[{\mathscr{L}}(\theta) \approx \frac{1}{N}\,\sum_{k=1}^{N}\!\frac{\ell_k(\theta)}{w_k}.\]
How shall we choose our trajectories? One idea would be to choose them so as to simplify the computation. If we choose them such that \[w_k=\prod_{t=1}^{T} {\mathbb{P}\left[{x_{t,k}|x_{t-1,k},\theta}\right]},\] then we have \[{\mathscr{L}}(\theta) \approx \frac{1}{N}\,\sum_{k=1}^{N} \frac{\ell_k(\theta)}{w_k} = \frac{1}{N}\,\sum_{k=1}^{N}\!\frac{\prod_{t=1}^{T} {\mathbb{P}\left[{y_t|x_{t,k},\theta}\right]}\,{\mathbb{P}\left[{x_{t,k}|x_{t-1,k},\theta}\right]}}{\prod_{t=1}^{T} {\mathbb{P}\left[{x_{t,k}|x_{t-1,k},\theta}\right]}} = \frac{1}{N}\,\sum_{k=1}^{N}\!\prod_{t=1}^{T} {\mathbb{P}\left[{y_t|x_{t,k},\theta}\right]}\]
This implies that if we generate trajectories by simulation, all we have to do is compute the likelihood of the data with given each trajectory and average.
Let’s go back to the boarding school influenza outbreak to see what this would look like. Let’s reconstruct the toy SIR model we were working with.
baseurl <- "http://kinglab.eeb.lsa.umich.edu/SBIED/"
url <- paste0(baseurl,"data/bsflu_data.txt")
bsflu <- read.table(url)
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-gamma*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init <- Csnippet("
S = nearbyint(N)-1;
I = 1;
R = 0;
H = 0;
")
dmeas <- Csnippet("lik = dbinom(B,H,rho,give_log);")
rmeas <- Csnippet("B = rbinom(H,rho);")
pomp(bsflu,times="day",t0=0,
rprocess=euler.sim(sir_step,delta.t=1/5),
initializer=sir_init,rmeasure=rmeas,dmeasure=dmeas,
zeronames="H",statenames=c("H","S","I","R"),
paramnames=c("beta","gamma","rho","N")) -> sir
plot(sir,main="",var="B")
Let’s generate a bunch of simulated trajectories at some particular point in parameter space.
simulate(sir,params=c(beta=2,gamma=1,rho=0.5,N=2600),
nsim=60,states=TRUE) -> x
matplot(time(sir),t(x["H",,]),type='l',lty=1,
xlab="time",ylab="H",bty='l')
We can use the function dmeasure to evaluate the log likelihood of the data given the states, the model, and the parameters:
ell <- dmeasure(sir,y=obs(sir),x=x,times=time(sir),log=TRUE,
params=c(beta=2,gamma=1,rho=0.5,N=2600))
dim(ell)
## [1] 60 14
According to the equation above, we should sum up the log likelihoods across time:
ell <- apply(ell,1,sum); ell
## [1] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [8] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [15] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [22] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [29] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [36] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [43] -Inf -331.7429 -Inf -Inf -Inf -Inf -Inf
## [50] -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## [57] -Inf -Inf -Inf -Inf
The variance of these estimates is very high and therefore the estimated likelihood is very imprecise. We are going to need very many simulations to get an estimate of the likelihood sufficiently precise to be of any use in parameter estimation or model selection. Moreover, even one simulation which is strictly incompatible with the data implies a likelihood of zero!
What’s the problem? Essentially, far too many of the trajectories don’t pass near the data. Moreover, once a trajectory diverges from the data, it almost never comes back. This is a consequence of the fact that we are proposing trajectories in a way that is completely unconditional on the data. The problem will only get worse with longer data sets!
We can arrive at a more efficient algorithm by factorizing the likelihood in a different way. \[{\mathscr{L}}(\theta)={\mathbb{P}\left[{y^*_{1:T}|\theta}\right]} =\prod_{t}\,{\mathbb{P}\left[{y^*_t|y^*_{1:t-1},\theta}\right]} =\prod_{t}\,\sum_{x_{t}}\,{\mathbb{P}\left[{y^*_t|x_t,\theta}\right]}\,{\mathbb{P}\left[{x_t|y^*_{1:t-1},\theta}\right]}\] Now the Markov property gives us that \[{\mathbb{P}\left[{x_t|y^*_{1:t-1},\theta}\right]} = \sum_{x_{t-1}}\,{\mathbb{P}\left[{x_t|x_{t-1},\theta}\right]}\,{\mathbb{P}\left[{x_{t-1}|y^*_{1:t-1},\theta}\right]}\] and Bayes’ theorem tells us that \[{\mathbb{P}\left[{x_t|y^*_{1:t},\theta}\right]} = {\mathbb{P}\left[{x_t|y_t,y^*_{1:t-1},\theta}\right]} =\frac{{\mathbb{P}\left[{y^*_{t}|x_{t},\theta}\right]}\,{\mathbb{P}\left[{x_{t}|y^*_{1:t-1},\theta}\right]}}{\sum_{x_{t}}\,{\mathbb{P}\left[{y^*_{t}|x_{t},\theta}\right]}\,{\mathbb{P}\left[{x_{t}|y^*_{1:t-1},\theta}\right]}}.\]
This suggests that we keep track of two key distributions. We’ll refer to the distribution of \(x_t | y^*_{1:t-1}\) as the prediction distribution at time \(t\) and the distribution of \(x_{t} | y^*_{1:t}\) as the filtering distribution at time \(t\).
Let’s use Monte Carlo techniques to estimate the sums. Suppose \(x_{t-1,k}^{F}\), \(k=1,\dots,N\) is a set of points drawn from the filtering distribution at time \(t-1\). We obtain a sample \(x_{t,k}^{P}\) of points drawn from the prediction distribution at time \(t\) by simply simulating the process model: \[x_{t,k}^{P} \sim \mathrm{process}(x_{t-1,k}^{F},\theta), \qquad k=1,\dots,N.\] Having obtained \(x_{t,k}^{P}\), we obtain a sample of points from the filtering distribution at time \(t\) by resampling from \(x_{t,k}^{P}\) with weights \({\mathbb{P}\left[{y^*_{t}|x_{t},\theta}\right]}\). The Monte Carlo theorem tells us, too, that the conditional likelihood \[\ell_t(\theta) = {\mathbb{P}\left[{y^*_t|y^*_{1:t-1},\theta}\right]} = \sum_{x_{t}}\,{\mathbb{P}\left[{y^*_{t}|x_{t},\theta}\right]}\,{\mathbb{P}\left[{x_{t}|y^*_{1:t-1},\theta}\right]} \approx \frac{1}{N}\,\sum_k\,{\mathbb{P}\left[{y^*_{t}|x_{t,k}^{P},\theta}\right]}.\] We can iterate this procedure through the data, one step at a time, alternately simulating and resampling, until we reach \(t=T\). The full log likelihood is then approximately \[\log{{\mathscr{L}}(\theta)} = \sum_t \log{\ell_t(\theta)}.\]
This is known as the sequential Monte Carlo algorithm or the particle filter. Key references include Kitagawa (1987), Arulampalam et al. (2002) and the book by Doucet et al. (2001).
Here, we’ll get some practical experience with the particle filter, and the likelihood function, in the context of our influenza-outbreak case study.
sims <- simulate(sir,params=c(beta=2,gamma=1,rho=0.8,N=2600),nsim=20,
as.data.frame=TRUE,include.data=TRUE)
ggplot(sims,mapping=aes(x=time,y=B,group=sim,color=sim=="data"))+
geom_line()+guides(color=FALSE)
In pomp, the basic particle filter is implemented in the command pfilter. We must choose the number of particles to use by setting the Np argument.
pf <- pfilter(sir,Np=5000,params=c(beta=2,gamma=1,rho=0.8,N=2600))
logLik(pf)
## [1] -80.42943
We can run a few particle filters to get an estimate of the Monte Carlo variability:
pf <- replicate(10,pfilter(sir,Np=5000,params=c(beta=2,gamma=1,rho=0.8,N=2600)))
ll <- sapply(pf,logLik); ll
## [1] -78.92451 -80.96060 -83.91316 -85.23849 -79.95454 -77.56114 -76.10271
## [8] -80.56887 -82.69699 -79.48503
logmeanexp(ll,se=TRUE)
## se
## -78.091617 2.274851
Note that we’re careful here to counteract Jensen’s inequality. The particle filter gives us an unbiased estimate of the likelihood, not of the log-likelihood.
To get an idea of what the likelihood surface looks like in the neighborhood of the default parameter set supplied by sir, we can construct some likelihood slices. We’ll make slices in the \(\beta\) and \(\gamma\) directions. Both slices will pass through the default parameter set.
sliceDesign(
c(beta=2,gamma=1,rho=0.8,N=2600),
beta=rep(seq(from=0.5,to=4,length=40),each=3),
gamma=rep(seq(from=0.5,to=2,length=40),each=3)) -> p
require(foreach)
require(doMC)
registerDoMC(cores=5) ## number of cores on this machine
set.seed(998468235L,kind="L'Ecuyer")
mcopts <- list(preschedule=FALSE,set.seed=TRUE)
foreach (theta=iter(p,"row"),.combine=rbind,
.inorder=FALSE,.options.multicore=mcopts) %dopar%
{
pfilter(sir,params=unlist(theta),Np=5000) -> pf
theta$loglik <- logLik(pf)
theta
} -> p
Note that we’ve used the foreach package with the multicore backend (doMC) to parallelize these computations. To ensure that we have high-quality random numbers in each parallel R session, we use a parallel random number generator (kind="L'Ecuyer", .options.multicore=list(set.seed=TRUE)).
foreach (v=c("beta","gamma")) %do%
{
x <- subset(p,slice==v)
plot(x[[v]],x$loglik,xlab=v,ylab="loglik")
}
Add likelihood slices along the \(\rho\) direction.
Slices offer a very limited perspective on the geometry of the likelihood surface. With just two parameters, we can evaluate the likelihood at a grid of points and visualize the surface directly.
expand.grid(beta=seq(from=1,to=4,length=50),
gamma=seq(from=0.7,to=3,length=50),
rho=0.8,
N=2600) -> p
foreach (theta=iter(p,"row"),.combine=rbind,
.inorder=FALSE,.options.multicore=mcopts) %dopar%
{
pfilter(sir,params=unlist(theta),Np=5000) -> pf
theta$loglik <- logLik(pf)
theta
} -> p
pp <- mutate(p,loglik=ifelse(loglik>max(loglik)-100,loglik,NA))
ggplot(data=pp,mapping=aes(x=beta,y=gamma,z=loglik,fill=loglik))+
geom_tile(color=NA)+
geom_contour(color='black',binwidth=3)+
scale_fill_gradient()+
labs(x=expression(beta),y=expression(gamma))
expand.grid(beta=seq(from=1,to=3,length=50),
gamma=1,
rho=0.8,
N=seq(from=1600,to=3000,length=50)) -> p
foreach (theta=iter(p,"row"),.combine=rbind,
.inorder=FALSE,.options.multicore=mcopts) %dopar%
{
pfilter(sir,params=unlist(theta),Np=5000) -> pf
theta$loglik <- logLik(pf)
theta
} -> p
pp <- mutate(p,loglik=ifelse(loglik>max(loglik)-100,loglik,NA))
ggplot(data=pp,mapping=aes(x=beta,y=N,z=loglik,fill=loglik))+
geom_tile(color=NA)+
geom_contour(color='black',binwidth=3)+
scale_fill_gradient()+
labs(x=expression(beta),y=expression(N))
Clearly, the default parameter set is not particularly close to the MLE. One way to find the MLE is to try optimizing the estimated likelihood directly. There are many optimization algorithms to choose from, and many implemented in R.
Three issues arise immediately.
Np larger, but we cannot make it go away. If we use a deterministic optimizer (i.e., one that assumes the objective function is evaluated deterministically), then we must control this variability somehow. For example, we can fix the seed of the pseudo-random number generator. A side effect will be that the objective function becomes jagged, marked by many small local knolls and pits. Alternatively, we can use a stochastic optimization algorithm, with which we will be only be able to obtain estimates of our MLE. This is the trade-off between a noisy and a rough objective function.Here, let’s opt for deterministic optimization of a rough function. We’ll try using optim’s default method: Nelder-Mead, fixing the random-number generator seed to make the likelihood calculation deterministic. Since Nelder-Mead is an unconstrained optimizer, we must transform the parameters. The following Csnippets encode an appropriate transformation and its inverse, and introduce them into the pomp object.
toEst <- Csnippet("
Tbeta = log(beta);
Tgamma = log(gamma);
Trho = logit(rho);
")
fromEst <- Csnippet("
Tbeta = exp(beta);
Tgamma = exp(gamma);
Trho = expit(rho);
")
pomp(sir,toEstimationScale=toEst,
fromEstimationScale=fromEst,
paramnames=c("beta","gamma","rho")) -> sir
Let’s fix a reference point in parameter space and insert these parameters into the pomp object:
coef(sir) <- c(beta=2,gamma=1,rho=0.8,N=2600)
The following constructs a function returning the negative log likelihood of the data at a given point in parameter space.
neg.ll <- function (par, est, ...) {
## parameters to be estimated are named in 'est'
allpars <- coef(sir,transform=TRUE)
allpars[est] <- par
try(
pfilter(sir,params=partrans(sir,allpars,dir="fromEst"),
Np=1000,seed=3488755L,...)
) -> pf
if (inherits(pf,"try-error")) {
1e10 ## a big, bad number
} else {
-logLik(pf)
}
}
Now we call optim to minimize this function:
## use Nelder-Mead with fixed RNG seed
fit <- optim(
par=c(log(1), log(2), log(0.8)),
est=c("gamma","beta","rho"),
fn=neg.ll,
method="Nelder-Mead",
control=list(maxit=400,trace=0)
)
mle <- sir
coef(mle,c("gamma","beta","rho"),transform=TRUE) <- fit$par
coef(mle,c("gamma","beta","rho")) ## point estimate
## gamma beta rho
## 0.4632256 2.2172983 0.5680403
fit$val
## [1] 77.22541
simulate(mle,nsim=8,as.data.frame=TRUE,include=TRUE) -> sims
lls <- replicate(n=5,logLik(pfilter(mle,Np=5000)))
ll <- logmeanexp(lls,se=TRUE); ll
## se
## -82.191039 0.525833
Some simulations at these parameters are shown next:
ggplot(data=sims,mapping=aes(x=time,y=B,group=sim,color=sim=="data"))+
geom_line()
Construct likelihood slices through the MLE we just found.
Evaluate the likelihood at points on a grid lying in a 2D slice through the MLE we found above. Each group should choose a different slice. Afterward, we’ll compare results across groups.
The search of parameter space we conducted above was local. It is possible that we found a local maximum, but that other maxima exist with higher likelihoods. Conduct a more thorough search by initializing the Nelder-Mead starting points across a wider region of parameter space. Do you find any other local maxima?
Arulampalam, M. S., S. Maskell, N. Gordon, and T. Clapp. 2002. A tutorial on particle filters for online nonlinear, non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50:174–188.
Doucet, A., N. de Freitas, and N. Gordon, editors. 2001. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York.
Kitagawa, G. 1987. Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association 82:1032–1041.
Pawitan, Y. 2001. In All Likelihood: Statistical Modelling and Inference Using Likelihood. Clarendon Press, Oxford.